Charles Minard’s Map of Napolean’s Invasion of Russia:
Grab those dependencies:
install.packages(c("ggplot2", "RColorBrewer", "scales"),
repos='http://r.adu.org.za/')
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)R makes this very easy for us. Just a single line and we can work with the data.
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)That line turns the csv into a data frame - sort of like a table in R. You can even just put a hyperlink in there and R will download the file. header=T tells R that the first line is the header.
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size)) + geom_histogram(binwidth=1)
plotWe pass in the dataframe to ggplot. We then specify the aesthetics, in this case listicle_size is a column in the dataframe (R knows that from the headings in the CSV) and ggplot works out that we want this on the x-axis.
library(ggplot2)
library(scales)
library(grid)
library(RColorBrewer)
source("fte-theme.R")
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size))
+ geom_histogram(binwidth=1)
+ fte_theme()
plotSo that looked a bit better. But I think we can still add a bit more.
Let’s give it some axis titles, a nice heading, and fit a few more breaks along the x and y axes. We’ll also add a touch of color and transparency. I’ve omitted some of the imports for brevity.
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(listicle_size))
+ geom_histogram(binwidth=1, fill="#bc70e7", alpha=0.75)
+ fte_theme()
+ labs(title="Distribution of Listicle Sizes for BuzzFeed Listicles",
x="# of Entries in Listicle",
y="# of Listicles")
+ scale_x_continuous(breaks=seq(0,50, by=5))
+ scale_y_continuous(labels=comma)
+ geom_hline(yintercept=0, size=0.4, color="black")
plotThe amazing ggplot2 can also do scatter plots. We can give it all 15 101 points to plot and it’ll happily do that for us. Try that in excel.
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
geom_point()
plotYou may be able to understand that graph but it has one big issue. There are a few listicles in the dataset with over 1 000 000 shares and that forces the rest of the points really close to the bottom. But we can fix this. With a log scale!
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
geom_point(alpha=0.05) +
scale_y_log10(labels=comma)
plotAlso note that I’ve given each point an alpha value of 0.05. This means that each point is only 5% opaque (or 95% transparent). This serves to enhance places in the plot where many points are congregated.
Of course! Add our theme again and some axis titles.
df <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
geom_point(alpha=0.05) +
scale_y_log10(labels=comma) +
fte_theme() +
labs(x="# of Entries in Listicle",
y="# of Facebook Shares",
title="FB Shares vs. Listicle Size for BuzzFeed Listicles")
plotdf <- read.csv("buzzfeed_linkbait_headlines.csv", header=T)
plot <- ggplot(df, aes(x=listicle_size, y=num_fb_shares)) +
geom_point(alpha=0.05, color="#bc70e7") +
scale_x_continuous(breaks=seq(0,50, by=5)) +
scale_y_log10(labels=comma, breaks=10^(0:6)) +
scale_y_log10(labels=comma) +
geom_hline(yintercept=1, size=0.4, color="black") +
geom_smooth(alpha=0.25, color="black", fill="black") +
fte_theme() +
labs(x="# of Entries in Listicle",
y="# of Facebook Shares",
title="FB Shares vs. Listicle Size for BuzzFeed Listicles")
plotWe’ve also added a line of best fit (with confidence).